June 25, 2014

Why "Reproducible Science" Talk?

  • Training for Ocean Health Index regionaliziation: Canada, China, Israel, Baltic, S America

  • Topic from local Meetups: R, Data Science

  • Why I love Github, R and RStudio

Outline

  • Ocean Health Index
  • Toolbox
  • Github
  • RStudio
  • Data wrangling

What is a Healthy Ocean?

  • Is it pristine?
  • "A healthy ocean sustainably delivers a range of benefits to people now and in the future."

Dimensions

Toolbox Goals

  • Recalculate OHI globally or regionally using alternative weights, equations, layers, etc.
  • Regionalize based on administrative boundaries finer than EEZ.
  • Visualize results to highlight best opportunities for improving ocean health.
  • Interface with easy-to-use forms for sliding weights and concocting scenarios.
  • Automate with tools for manipulating input layers and calculating OHI scores for sensitivity analyses.

Regionalize

US West Coast Halpern et al (2014) PLoS ONE

Brazil Elfes et al (2014) PLoS ONE

Visualize

Flower

Visualize

Map

Process

Toolbox

Regionalization Strategy

  • examples
  • globally
    • political: Global Administrative Areas (GADM)
    • biogeographic: Marine Ecoregions of the World (MEOW)
    • data:
      • pressures: extract from 1km Cumulative Impact rasters (Halpern et al 2008, Halpern et al in draft)
      • other: weight country values from ohi-global by area / coastal population / … of region
    • populate ohi-[country] scenario repository
    • deploy to ShinyApps.io for interactive website

Scenario files

  • layers.csv, layers/
    • *.csv
  • scenario.R, conf/
    • config.R
    • pressures_matrix.csv, resilience_matrix.csv, resilience_weights.csv
    • goals.csv
    • functions.R
  • spatial/regions_gcs.js
  • launchApp_code.R, launchApp.bat (Win), launchApp.command (Mac)
  • scores.csv
  • results/report.html, /figures

Simulation

For example, calculate Baltic Health Index every year using scenarios bhi1980,..., bhi2014 as folders.

library(ohicore)

for (dir_scenario in sprintf('~/ohibaltic/bhi%d', 1980:2014)){
  setwd(dir_scenario)
  
  conf   = Conf('conf')
  layers = Layers('layers.csv', 'layers')
  scores = CalculateAll(conf, layers)
  
  write.csv(scores, 'scores.csv')
}

Software choices for reproducible science

free, cross-platform, open source, web based:

  • csv (comma-seperated value) data files. ancillary: md, json, shp, geotiff
    • Excel poor with Unicode, file locking. Try OpenOffice instead.
  • R having libraries shiny web application, ggplot2 figures, dplyr data manipulation
  • Github repositories:
    • backup to offsite archive, and rewind changes
    • document changes of code and files with issues and messages
    • collaborate with others and publish to web site

Github Repositories

ohiprep | ohi-[scenario] | ohicore

OHI for Github

Fork and Pull Model

github.com/[org]/[repo] (org web) github.com/[user]/[repo] (user web) ~/github/[repo] (user local)
->1x -> fork -> clone
<- merge pull request {admin} <- <- pull request <- push, <-> commit

where: * [org] is an organization (eg ohi-science) * [repo] is a repository in the orgranization (eg ohiprep) * [user] is your github username (eg bbest)

Github Features

  • track changes, issues, etc. free for public repos
  • max: 1GB per repo, 100MB per file. so larger files (and binary) on file server, with remote vpn option
  • render markdown, eg README.md

Other Github Features

Track Changes View with "Rendered" button to view differences between versions of a text file: additions in green, removals in red strikethrough.

Other Github Features

CSV View allows for on the fly tabular view, searching for text, and linking to specific rows of data.

Other Github Features

Geographic View of GeoJSON renders automatically as a map.

RStudio: Documenting with Markdown

  • markdown is a plain text formatting syntax for conversion to HTML (with a tool)

  • r markdown enables easy authoring of reproducible web reports from R

  • in rstudio

Embedding R code

  • chunks: text, tables, figures

Embedding R code

  • inline: pi=`r pi` evaluates to "pi=3.1416"

Embedding equations

  • inline

The Arithmetic mean is equal to $\frac{1}{n} \sum_{i=i}^{n} x_{i}$, or the summation of n numbers divided by n.

The Arithmetic mean is equal to \(\frac{1}{n} \sum_{i=i}^{n} x_{i}\), or the summation of n numbers divided by n.

Embedding equations (2)

  • chunked
$$
\frac{1}{n} \sum_{i=i}^{n} x_{i}
$$

\[ \frac{1}{n} \sum_{i=i}^{n} x_{i} \]

Online friendly

Github in RStudio

RStudio: File > New Project > Version Control

  • clone

Github in RStudio (2)

  • commit and push

Github in RStudio (3)

data wrangling with dplyr

  • dplyr is the next iteration of plyr, focussed on tools for working with data frames.

data wrangling task

Calculate the batting average (AVG): number of base hits (H) divided by the total number of at bats (AB) using the Lahman baseball database. Limit to Babe Ruth and Jackie Robinson.

  • setup
library(Lahman)
library(dplyr)
library(RSQLite)
  • answer
nameFirst nameLast    avg
      Babe      Ruth  0.323
    Jackie  Robinson  0.308

data wrangling: sql

  • sql
tbl(lahman_sqlite(), sql(
"SELECT nameFirst, nameLast, 
  ROUND(AVG(H/(AB*1.0)), 3) AS avg 
FROM Batting
JOIN Master USING (playerID)
WHERE AB > 0 AND ((
  (nameFirst = 'Babe' AND 
   nameLast = 'Ruth') OR 
  (nameFirst = 'Jackie' AND 
   nameLast = 'Robinson')) 
GROUP BY nameFirst, nameLast
ORDER BY avg DESC")))

data wrangling: dplyr

  • chaining (%.%): grammar of data manipulation
Batting %.%
  merge(Master, by='playerID') %.%
  filter(
    AB > 0 &
    (nameFirst=='Babe' & 
     nameLast =='Ruth') | 
    (nameFirst=='Jackie' & 
     nameLast =='Robinson')) %.%  
  group_by(nameFirst, nameLast) %.%
  summarise(avg = round(mean(H/AB), 3)) %.%
  arrange(desc(avg))

For More…